Machine Learning in Python

In this lesson you'll learn how to do some of the statistical modelling techniques that you've learnt so far in Python. Python and R take rather different approaches to statistical modelling. R was designed for statistics, and has built in support for t-tests, linear models etc. Python was not designed for statistics, and while it is possible to do all types of statistical tests in Python, it's not as straight forward.

Python is generally designed for machine learning. This generally means models are focused on predictive accuracy, rather than interpretation. Python also has a focus on machine learning pipelines; this forces you to think more about how your data is going to be processed.

With modern R and modern Python there are very few techniques that you can't do in both languages. The differences lie in what is easy to do, rather than what is possible.

Learning Objectives

Lesson Duration: 2 hours

Linear Regression

Exploring the data

To start we are going to see an example of building a linear model. The dataset we are going to use is "conspiracy_belief_score.csv". This dataset comes from the Open Psychometrics project. It has information on the conspiracy belief score for people who took an online test. We also have information on people backgrounds, including their age, gender and where they live.

We are going to try to model the conspiracy belief score using some of the other variables available to us in the dataset.

Before we start building models we need to make sure we understand the data.

First we read the data in:

Then use head() to look at the first few rows.

And describe() to get some summary statistics.

There is lots you could do to explore this data using pandas and seaborn, but we're going to cheat by using a Python package!

Install pandas profiling using the following command in the terminal.

conda install -c conda-forge pandas-profiling

This package has a function ProfileReport which has lots of plots and summaries that are particularly useful for doing modelling. Run this now.

Task - 5 minutes

Have a look at the report.

Do you see anything that needs to be cleaned?

Cleaning data

After you have explored the data you'll need to clean the data. As you know from making models in R, there are lots of things you might need to do to clean the data.

We have given you a nice clean dataset here so the only thing we need to do is create dummies.

Creating dummies

Unlike R, when you are working in Python you need to create dummies yourself. Luckily, there is a function from pandas that makes this easy.

We set drop_first = True because for a variable with four levels, you only need three dummy variables.

Building the model

Now finally we can build our model!

To build all the models today we are going to use a package called scikit-learn. This package has a huge range of different models. Unlike R, we don't have any models built in to the language. Also, in R you generally have to download a new package for each new model you want to run. This is different in Python and almost every model you could never need is inside scikit-learn.

You do not need to install anything, as scikit-learn comes as part of Anaconda.

Since scikit-learn is so large, we normally import the models one at a time.

The syntax for describing a model is different in Python compared to R. We need to make a data frame with all the variables we are using for predicting, and we need the variable we want to predict in it's own array.

Then we define the model we want to use and fit it using the data we extracted.

Unlike R, we don't have a handy summary of the model. But we can pull out the R-squared value by using the score method.

We know the score method brings back R-squared for linear regression from the info in the LinearRegression() documentation here.

To get the coefficients, we need to do it in two parts. We get the coefficient for the intercept by looking at the intercept_ attribute.

And we get the rest of the coefficients using the coef_ attribute. These coefficients are returned in the same order as the variables appear in the data.

Task - 5 mins

  1. Interpret the R-squared value
  2. Interpret some of the coefficients. (Hint - think about how you could combine this array output with the column names to easily see which value corresponds to what variable in the data).

Solution

  1. These are very low R-squared values. We're not doing a very good job at predicting conspiracy belief score!

  2. See below

For every extra member of your family your conspiracy belief score increases by 0.037 etc.

These are hard to interpret because we don't have the associated p-values. We also have available the statsmodels API, which provides an 'ordinary least squares' OLS() function with much the same information provided by lm() in R.

A weird wrinkle of this function is that it doesn't include the intercept term by default, but we can add it via the add_constant() function in the same package (this just adds a column const filled with value 1).

What about diagnostic plots? Well, unfortunately, we end up doing these manually, but it's no big deal. Let's create a residual vs fitted plot:

And now let's create a normal-QQ plot. We can do this using the probplot() function in the scipy.stats module like this:

Logistic Regression

Now let's see an example of logistic regression. The file loans.csv has real data from an online platform called Lending Club. Each row represents a loan given out on Lending Club. We have information on what the loan is for, the interest rate charged etc. We want to predict if a loan will be paid back.

Exploring the data

Let's start by reading in the data and having a look at using pandas-profiling.

Cleaning the data

There's a few things we need to do to clean this data. First, we are only interested in historical loan data. So let's filter the data.

We saw from the pandas profiling report there was quite a few missing values here. Let's check there are still missing values in this subsetted dataset.

We've still got some missing values. The right thing to do would be to investigate why the values were missing and decide what action to take. Back when you learnt about the tidyverse we discussed all the ways of dealing with missing values. For this lesson, let's be lazy and just delete all missing values.

Let's check we have removed the missing values:

We want to predict if a loan has been paid off, but we have several values for loan_status. We need to make a new column that just checks if loan status is equal to "Fully Paid", and return a value of 1 if so and 0 otherwise (since scikit-learn always needs numeric data). We can use the function where from numpy that we saw yesterday to do this.

Now let's drop the loan_status column:

Again, since we need numeric data then we need dummy variables.

Now finally we can split our data into predictors and the target variable.

For this example, since we are more interested in making predictions than interpreting the model, let's split the data into a test and training set. We can do this using the function train_test_split from scikit-learn.

Here we have set the test_size to 0.1, to take 10% of the data as a test set. Specifying random_state is like using set.seed in R. This makes the random splitting of the data reproducible.

The code above uses Python "syntactic sugar". Since train_test_split returns a tuple, we can define several variables at once.

Building the model

Now that we have the test and training set we can run our model. This time we are using logistic regression, so we need to import LogisticRegression, define our model object, then fit to the training data.

Now we can see the mean accuracy on the training data. Like with LinearRegression we know the metric returned from score() is accuracy from the documentation here.

And on the test data.

We can also get other metrics such as ROC and AUC. Recalling that to calculate the ROC what we want is the predicted probabilities and then the observed values in the data. First we want to return the predicited probabilities from our model:

We see we get an array with 2 entries for every row of the data - these are the probabilities for both outcomes (0 or 1) for the paid column - i.e. every row will add to 1. We just want the probabilities of the positive outcome - that the loan will be paid - which is the right hand column of the array:

Now we have the 2 pieces of information required to calculate our AUC score from the ROC. We use the roc_auc_score function to calculate this:

Decision Trees

Now we are going to see an example of using a different model on the same data.

You should be beginning to see a pattern in how to use a model from scikit-learn: first we import the model, then we define a model object and finally we fit the model to the data.

To stop the tree getting too complicated we are going to change some parameters on the model object. We are going to set max_features to 3, this means we will use at most three variables. We will also limit the depth of the tree to 5. In reality we would tune these hyperparamters to work out their optimal value.

Again, we can get the mean accuracy for this model by using the score method.

Random Forest

In the last part of this lesson we are going to cover a model that you have not seen before.

Random forest is a powerful, general classifier algorithm. It is called random forest because it uses a collection of decision trees!

First we set the number of trees you want to build. Then you take a random sub sample of the data and build a decision tree on it. You repeat this process for every tree. Each tree should be built on it's own sub sample of the data. Now we have a big collection of decision trees.

If we want to make a prediction, first we make a prediction using every tree in the set, then we take the average answer. For example, if we have 10 trees and 3 of them guess that a loan will not be repaid and 7 guess that it will, then our answer will be a 70% chance that the loan is repaid.

This has a big advantage over decision trees because decision trees can be very sensitive to small changes in data. A slightly different dataset will give a totally different tree structure. We can "average" over the possible tree structures for the data by using a random forest.

This type of model building, where we average across several models, is often called "ensemble learning".

Let's see an example in Python. Again, we import he model, define the model object and finally fit to our data.

Here we set n_estimators to be 10, because we are using 10 decision trees.

Task - 5 minutes

Find the mean accuracy on the test data and compare it to our two other models.

Solution

There is randomness here, so your model will have slightly different results!

</summary> </blockquote>

Additional Resources

The patsy library will let you use formula notation in Python: https://patsy.readthedocs.io/en/latest/